The UCI ML Breast Cancer Wisconsin (Diagnostic) dataset contains features computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. These features describe characteristics of cell nuclei present in the image. The target variable is binary, diagnosing whether the mass is malignant or benign.

Refer to the documentation for the load_breast_cancer function in the scikit-learn library for more information about this dataset and instructions on how to load it directly.

The objective of this project is to apply Principal Component Analysis (PCA) to reduce dimensionality and select the most relevant components, enhancing model performance while retaining essential information for accurate breast cancer classification.

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.decomposition import PCA
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from scipy.spatial.distance import mahalanobis
from scipy.stats import chi2
In [2]:
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
data.target[[10, 50, 85]]
list(data.target_names)
Out[2]:
['malignant', 'benign']
In [3]:
print(data.DESCR)
.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

:Number of Instances: 569

:Number of Attributes: 30 numeric, predictive attributes and the class

:Attribute Information:
    - radius (mean of distances from center to points on the perimeter)
    - texture (standard deviation of gray-scale values)
    - perimeter
    - area
    - smoothness (local variation in radius lengths)
    - compactness (perimeter^2 / area - 1.0)
    - concavity (severity of concave portions of the contour)
    - concave points (number of concave portions of the contour)
    - symmetry
    - fractal dimension ("coastline approximation" - 1)

    The mean, standard error, and "worst" or largest (mean of the three
    worst/largest values) of these features were computed for each image,
    resulting in 30 features.  For instance, field 0 is Mean Radius, field
    10 is Radius SE, field 20 is Worst Radius.

    - class:
            - WDBC-Malignant
            - WDBC-Benign

:Summary Statistics:

===================================== ====== ======
                                        Min    Max
===================================== ====== ======
radius (mean):                        6.981  28.11
texture (mean):                       9.71   39.28
perimeter (mean):                     43.79  188.5
area (mean):                          143.5  2501.0
smoothness (mean):                    0.053  0.163
compactness (mean):                   0.019  0.345
concavity (mean):                     0.0    0.427
concave points (mean):                0.0    0.201
symmetry (mean):                      0.106  0.304
fractal dimension (mean):             0.05   0.097
radius (standard error):              0.112  2.873
texture (standard error):             0.36   4.885
perimeter (standard error):           0.757  21.98
area (standard error):                6.802  542.2
smoothness (standard error):          0.002  0.031
compactness (standard error):         0.002  0.135
concavity (standard error):           0.0    0.396
concave points (standard error):      0.0    0.053
symmetry (standard error):            0.008  0.079
fractal dimension (standard error):   0.001  0.03
radius (worst):                       7.93   36.04
texture (worst):                      12.02  49.54
perimeter (worst):                    50.41  251.2
area (worst):                         185.2  4254.0
smoothness (worst):                   0.071  0.223
compactness (worst):                  0.027  1.058
concavity (worst):                    0.0    1.252
concave points (worst):               0.0    0.291
symmetry (worst):                     0.156  0.664
fractal dimension (worst):            0.055  0.208
===================================== ====== ======

:Missing Attribute Values: None

:Class Distribution: 212 - Malignant, 357 - Benign

:Creator:  Dr. William H. Wolberg, W. Nick Street, Olvi L. Mangasarian

:Donor: Nick Street

:Date: November, 1995

This is a copy of UCI ML Breast Cancer Wisconsin (Diagnostic) datasets.
https://goo.gl/U2Uwz2

Features are computed from a digitized image of a fine needle
aspirate (FNA) of a breast mass.  They describe
characteristics of the cell nuclei present in the image.

Separating plane described above was obtained using
Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree
Construction Via Linear Programming." Proceedings of the 4th
Midwest Artificial Intelligence and Cognitive Science Society,
pp. 97-101, 1992], a classification method which uses linear
programming to construct a decision tree.  Relevant features
were selected using an exhaustive search in the space of 1-4
features and 1-3 separating planes.

The actual linear program used to obtain the separating plane
in the 3-dimensional space is that described in:
[K. P. Bennett and O. L. Mangasarian: "Robust Linear
Programming Discrimination of Two Linearly Inseparable Sets",
Optimization Methods and Software 1, 1992, 23-34].

This database is also available through the UW CS ftp server:

ftp ftp.cs.wisc.edu
cd math-prog/cpo-dataset/machine-learn/WDBC/

|details-start|
**References**
|details-split|

- W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction
  for breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on
  Electronic Imaging: Science and Technology, volume 1905, pages 861-870,
  San Jose, CA, 1993.
- O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and
  prognosis via linear programming. Operations Research, 43(4), pages 570-577,
  July-August 1995.
- W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques
  to diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994)
  163-171.

|details-end|

Exploratory data analysis (EDA)¶

In [4]:
import pandas as pd

df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

display(df.head().T)

print(f"This dataset contains {df.shape[0]} rows and {df.shape[1]} columns.")
0 1 2 3 4
mean radius 17.990000 20.570000 19.690000 11.420000 20.290000
mean texture 10.380000 17.770000 21.250000 20.380000 14.340000
mean perimeter 122.800000 132.900000 130.000000 77.580000 135.100000
mean area 1001.000000 1326.000000 1203.000000 386.100000 1297.000000
mean smoothness 0.118400 0.084740 0.109600 0.142500 0.100300
mean compactness 0.277600 0.078640 0.159900 0.283900 0.132800
mean concavity 0.300100 0.086900 0.197400 0.241400 0.198000
mean concave points 0.147100 0.070170 0.127900 0.105200 0.104300
mean symmetry 0.241900 0.181200 0.206900 0.259700 0.180900
mean fractal dimension 0.078710 0.056670 0.059990 0.097440 0.058830
radius error 1.095000 0.543500 0.745600 0.495600 0.757200
texture error 0.905300 0.733900 0.786900 1.156000 0.781300
perimeter error 8.589000 3.398000 4.585000 3.445000 5.438000
area error 153.400000 74.080000 94.030000 27.230000 94.440000
smoothness error 0.006399 0.005225 0.006150 0.009110 0.011490
compactness error 0.049040 0.013080 0.040060 0.074580 0.024610
concavity error 0.053730 0.018600 0.038320 0.056610 0.056880
concave points error 0.015870 0.013400 0.020580 0.018670 0.018850
symmetry error 0.030030 0.013890 0.022500 0.059630 0.017560
fractal dimension error 0.006193 0.003532 0.004571 0.009208 0.005115
worst radius 25.380000 24.990000 23.570000 14.910000 22.540000
worst texture 17.330000 23.410000 25.530000 26.500000 16.670000
worst perimeter 184.600000 158.800000 152.500000 98.870000 152.200000
worst area 2019.000000 1956.000000 1709.000000 567.700000 1575.000000
worst smoothness 0.162200 0.123800 0.144400 0.209800 0.137400
worst compactness 0.665600 0.186600 0.424500 0.866300 0.205000
worst concavity 0.711900 0.241600 0.450400 0.686900 0.400000
worst concave points 0.265400 0.186000 0.243000 0.257500 0.162500
worst symmetry 0.460100 0.275000 0.361300 0.663800 0.236400
worst fractal dimension 0.118900 0.089020 0.087580 0.173000 0.076780
target 0.000000 0.000000 0.000000 0.000000 0.000000
This dataset contains 569 rows and 31 columns.
In [5]:
df.columns
Out[5]:
Index(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error', 'fractal dimension error',
       'worst radius', 'worst texture', 'worst perimeter', 'worst area',
       'worst smoothness', 'worst compactness', 'worst concavity',
       'worst concave points', 'worst symmetry', 'worst fractal dimension',
       'target'],
      dtype='object')

Check missing data¶

In [6]:
df.isnull().sum()
Out[6]:
mean radius                0
mean texture               0
mean perimeter             0
mean area                  0
mean smoothness            0
mean compactness           0
mean concavity             0
mean concave points        0
mean symmetry              0
mean fractal dimension     0
radius error               0
texture error              0
perimeter error            0
area error                 0
smoothness error           0
compactness error          0
concavity error            0
concave points error       0
symmetry error             0
fractal dimension error    0
worst radius               0
worst texture              0
worst perimeter            0
worst area                 0
worst smoothness           0
worst compactness          0
worst concavity            0
worst concave points       0
worst symmetry             0
worst fractal dimension    0
target                     0
dtype: int64
  • The dataset has 30 numerical features and 1 target.

  • No missing values.

Check outliers and patterns¶
In [7]:
plt.figure(figsize=(13, 10))

sns.pairplot(df.drop('target', axis=1), 
             diag_kind='hist', corner=True, diag_kws={'color': 'grey'})

plt.yticks(rotation=0)
plt.show()
<Figure size 1300x1000 with 0 Axes>
  • We can see that some features have extreme values (outliers).

  • We can see some pairs have linear relationship.

In [8]:
df.hist(figsize=(12, 10), bins=30, edgecolor="black")
plt.subplots_adjust(hspace=0.7, wspace=0.4)
  • The target variable has 2 values: 0 and 1

  • Most of features have long-tail distribution.

In [9]:
# Response variable
y = df['target']

# Explanatory variables
X = df.drop('target', axis=1)

Target variable¶

In [10]:
plt.figure(figsize=(8, 6))
sns.countplot(x='target', data=df, palette='Set2')
plt.title('Bar Chart of Target Variable')
plt.xlabel('Target')
plt.ylabel('Count')
plt.xticks([0, 1], ['0', '1'])  # If your target variable has string labels, you can use this line to set the tick labels
plt.show()
/var/folders/kt/rg6_d5c90v16qpfm3hltsfb80000gn/T/ipykernel_2519/317489736.py:2: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.countplot(x='target', data=df, palette='Set2')
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)

This is unbalance data >> we can improve by balancing the data

In [11]:
num_cols = X.columns

num_cols
Out[11]:
Index(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error', 'fractal dimension error',
       'worst radius', 'worst texture', 'worst perimeter', 'worst area',
       'worst smoothness', 'worst compactness', 'worst concavity',
       'worst concave points', 'worst symmetry', 'worst fractal dimension'],
      dtype='object')
In [12]:
# Box plots for outlier detection

ncols = 5

nrows = int(len(num_cols) / ncols) + (len(num_cols) % ncols > 0)

plt.figure(figsize=(20, nrows * 4))

for i, col in enumerate(num_cols, 1):
    plt.subplot(nrows, ncols, i)
    sns.boxplot(y=df[col])
    plt.title(col)
    plt.xticks([])  

plt.tight_layout()
plt.show()
//anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
//anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
//anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
//anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
//anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
//anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
//anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
//anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
//anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
//anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
//anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
//anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
//anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
//anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
//anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
//anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
//anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
//anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
//anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
//anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
//anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
//anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
//anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
//anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
//anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
//anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
//anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
//anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
//anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
//anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
  • The box plots and histograms show that the majority of the numerical variables are skewed to the right, meaning that there are a few observations with very high values compared to the rest.
Compute the Mahalanobis distance and detect outliers¶
In [13]:
cov_matrix = np.cov(X, rowvar=False)

inv_cov_matrix = np.linalg.inv(cov_matrix)

mean_vector = X.mean(axis=0)

# Function to compute Mahalanobis distance for each row in the dataset
def mahalanobis_distance(row, mean_vector, inv_cov_matrix):
    diff = row - mean_vector
    md = np.sqrt(np.dot(np.dot(diff, inv_cov_matrix), diff.T))
    return md

X['mahalanobis_distance'] = X.apply(lambda row: mahalanobis_distance(row, mean_vector, inv_cov_matrix), axis=1)

threshold = chi2.ppf(0.99, df=X.shape[1] - 1)  # 99% confidence level

# Identify outliers
X['is_outlier'] = X['mahalanobis_distance'] > threshold

outliers = X[X['is_outlier']]
print("Number of outliers detected:", outliers.shape[0])
print(outliers)

# Visualize the Mahalanobis distance
plt.figure(figsize=(10, 6))
sns.histplot(X['mahalanobis_distance'], bins=30, kde=True)
plt.axvline(x=threshold, color='red', linestyle='--', label='Threshold')
plt.title('Distribution of Mahalanobis Distance')
plt.xlabel('Mahalanobis Distance')
plt.ylabel('Frequency')
plt.legend()
plt.show()
Number of outliers detected: 0
Empty DataFrame
Columns: [mean radius, mean texture, mean perimeter, mean area, mean smoothness, mean compactness, mean concavity, mean concave points, mean symmetry, mean fractal dimension, radius error, texture error, perimeter error, area error, smoothness error, compactness error, concavity error, concave points error, symmetry error, fractal dimension error, worst radius, worst texture, worst perimeter, worst area, worst smoothness, worst compactness, worst concavity, worst concave points, worst symmetry, worst fractal dimension, mahalanobis_distance, is_outlier]
Index: []

[0 rows x 32 columns]

Correlation matrix¶

In [14]:
plt.figure(figsize=(13, 10))

corr = X.corr() # correlation
mask = np.triu(np.ones_like(corr, dtype=bool)) # masking the upper triangle
np.fill_diagonal(mask, False) # diagonal 1s

# heatmap
sns.heatmap(corr, annot=True, cmap='Blues', mask=mask)

plt.yticks(rotation=0)
plt.show()
the strong positive correlations: corr > 0.8¶
In [15]:
strong_positive_corr = corr[(corr > 0.8) & (corr < 1)]

sns.heatmap(strong_positive_corr, annot=True, cmap='Blues', mask=mask)

plt.yticks(rotation=0)
plt.show()
In [16]:
print("Strong Positive Correlations (coefficient > 0.8):")
print(strong_positive_corr.dropna(how='all', axis=0).dropna(how='all', axis=1))
Strong Positive Correlations (coefficient > 0.8):
                         mean radius  mean texture  mean perimeter  mean area  \
mean radius                      NaN           NaN        0.997855   0.987357   
mean texture                     NaN           NaN             NaN        NaN   
mean perimeter              0.997855           NaN             NaN   0.986507   
mean area                   0.987357           NaN        0.986507        NaN   
mean smoothness                  NaN           NaN             NaN        NaN   
mean compactness                 NaN           NaN             NaN        NaN   
mean concavity                   NaN           NaN             NaN        NaN   
mean concave points         0.822529           NaN        0.850977   0.823269   
radius error                     NaN           NaN             NaN        NaN   
perimeter error                  NaN           NaN             NaN        NaN   
area error                       NaN           NaN             NaN   0.800086   
compactness error                NaN           NaN             NaN        NaN   
concavity error                  NaN           NaN             NaN        NaN   
fractal dimension error          NaN           NaN             NaN        NaN   
worst radius                0.969539           NaN        0.969476   0.962746   
worst texture                    NaN      0.912045             NaN        NaN   
worst perimeter             0.965137           NaN        0.970387   0.959120   
worst area                  0.941082           NaN        0.941550   0.959213   
worst smoothness                 NaN           NaN             NaN        NaN   
worst compactness                NaN           NaN             NaN        NaN   
worst concavity                  NaN           NaN             NaN        NaN   
worst concave points             NaN           NaN             NaN        NaN   
worst fractal dimension          NaN           NaN             NaN        NaN   

                         mean smoothness  mean compactness  mean concavity  \
mean radius                          NaN               NaN             NaN   
mean texture                         NaN               NaN             NaN   
mean perimeter                       NaN               NaN             NaN   
mean area                            NaN               NaN             NaN   
mean smoothness                      NaN               NaN             NaN   
mean compactness                     NaN               NaN        0.883121   
mean concavity                       NaN          0.883121             NaN   
mean concave points                  NaN          0.831135        0.921391   
radius error                         NaN               NaN             NaN   
perimeter error                      NaN               NaN             NaN   
area error                           NaN               NaN             NaN   
compactness error                    NaN               NaN             NaN   
concavity error                      NaN               NaN             NaN   
fractal dimension error              NaN               NaN             NaN   
worst radius                         NaN               NaN             NaN   
worst texture                        NaN               NaN             NaN   
worst perimeter                      NaN               NaN             NaN   
worst area                           NaN               NaN             NaN   
worst smoothness                0.805324               NaN             NaN   
worst compactness                    NaN          0.865809             NaN   
worst concavity                      NaN          0.816275        0.884103   
worst concave points                 NaN          0.815573        0.861323   
worst fractal dimension              NaN               NaN             NaN   

                         mean concave points  radius error  perimeter error  \
mean radius                         0.822529           NaN              NaN   
mean texture                             NaN           NaN              NaN   
mean perimeter                      0.850977           NaN              NaN   
mean area                           0.823269           NaN              NaN   
mean smoothness                          NaN           NaN              NaN   
mean compactness                    0.831135           NaN              NaN   
mean concavity                      0.921391           NaN              NaN   
mean concave points                      NaN           NaN              NaN   
radius error                             NaN           NaN         0.972794   
perimeter error                          NaN      0.972794              NaN   
area error                               NaN      0.951830         0.937655   
compactness error                        NaN           NaN              NaN   
concavity error                          NaN           NaN              NaN   
fractal dimension error                  NaN           NaN              NaN   
worst radius                        0.830318           NaN              NaN   
worst texture                            NaN           NaN              NaN   
worst perimeter                     0.855923           NaN              NaN   
worst area                          0.809630           NaN              NaN   
worst smoothness                         NaN           NaN              NaN   
worst compactness                        NaN           NaN              NaN   
worst concavity                          NaN           NaN              NaN   
worst concave points                0.910155           NaN              NaN   
worst fractal dimension                  NaN           NaN              NaN   

                         ...  fractal dimension error  worst radius  \
mean radius              ...                      NaN      0.969539   
mean texture             ...                      NaN           NaN   
mean perimeter           ...                      NaN      0.969476   
mean area                ...                      NaN      0.962746   
mean smoothness          ...                      NaN           NaN   
mean compactness         ...                      NaN           NaN   
mean concavity           ...                      NaN           NaN   
mean concave points      ...                      NaN      0.830318   
radius error             ...                      NaN           NaN   
perimeter error          ...                      NaN           NaN   
area error               ...                      NaN           NaN   
compactness error        ...                 0.803269           NaN   
concavity error          ...                      NaN           NaN   
fractal dimension error  ...                      NaN           NaN   
worst radius             ...                      NaN           NaN   
worst texture            ...                      NaN           NaN   
worst perimeter          ...                      NaN      0.993708   
worst area               ...                      NaN      0.984015   
worst smoothness         ...                      NaN           NaN   
worst compactness        ...                      NaN           NaN   
worst concavity          ...                      NaN           NaN   
worst concave points     ...                      NaN           NaN   
worst fractal dimension  ...                      NaN           NaN   

                         worst texture  worst perimeter  worst area  \
mean radius                        NaN         0.965137    0.941082   
mean texture                  0.912045              NaN         NaN   
mean perimeter                     NaN         0.970387    0.941550   
mean area                          NaN         0.959120    0.959213   
mean smoothness                    NaN              NaN         NaN   
mean compactness                   NaN              NaN         NaN   
mean concavity                     NaN              NaN         NaN   
mean concave points                NaN         0.855923    0.809630   
radius error                       NaN              NaN         NaN   
perimeter error                    NaN              NaN         NaN   
area error                         NaN              NaN    0.811408   
compactness error                  NaN              NaN         NaN   
concavity error                    NaN              NaN         NaN   
fractal dimension error            NaN              NaN         NaN   
worst radius                       NaN         0.993708    0.984015   
worst texture                      NaN              NaN         NaN   
worst perimeter                    NaN              NaN    0.977578   
worst area                         NaN         0.977578         NaN   
worst smoothness                   NaN              NaN         NaN   
worst compactness                  NaN              NaN         NaN   
worst concavity                    NaN              NaN         NaN   
worst concave points               NaN         0.816322         NaN   
worst fractal dimension            NaN              NaN         NaN   

                         worst smoothness  worst compactness  worst concavity  \
mean radius                           NaN                NaN              NaN   
mean texture                          NaN                NaN              NaN   
mean perimeter                        NaN                NaN              NaN   
mean area                             NaN                NaN              NaN   
mean smoothness                  0.805324                NaN              NaN   
mean compactness                      NaN           0.865809         0.816275   
mean concavity                        NaN                NaN         0.884103   
mean concave points                   NaN                NaN              NaN   
radius error                          NaN                NaN              NaN   
perimeter error                       NaN                NaN              NaN   
area error                            NaN                NaN              NaN   
compactness error                     NaN                NaN              NaN   
concavity error                       NaN                NaN              NaN   
fractal dimension error               NaN                NaN              NaN   
worst radius                          NaN                NaN              NaN   
worst texture                         NaN                NaN              NaN   
worst perimeter                       NaN                NaN              NaN   
worst area                            NaN                NaN              NaN   
worst smoothness                      NaN                NaN              NaN   
worst compactness                     NaN                NaN         0.892261   
worst concavity                       NaN           0.892261              NaN   
worst concave points                  NaN           0.801080         0.855434   
worst fractal dimension               NaN           0.810455              NaN   

                         worst concave points  worst fractal dimension  
mean radius                               NaN                      NaN  
mean texture                              NaN                      NaN  
mean perimeter                            NaN                      NaN  
mean area                                 NaN                      NaN  
mean smoothness                           NaN                      NaN  
mean compactness                     0.815573                      NaN  
mean concavity                       0.861323                      NaN  
mean concave points                  0.910155                      NaN  
radius error                              NaN                      NaN  
perimeter error                           NaN                      NaN  
area error                                NaN                      NaN  
compactness error                         NaN                      NaN  
concavity error                           NaN                      NaN  
fractal dimension error                   NaN                      NaN  
worst radius                              NaN                      NaN  
worst texture                             NaN                      NaN  
worst perimeter                      0.816322                      NaN  
worst area                                NaN                      NaN  
worst smoothness                          NaN                      NaN  
worst compactness                    0.801080                 0.810455  
worst concavity                      0.855434                      NaN  
worst concave points                      NaN                      NaN  
worst fractal dimension                   NaN                      NaN  

[23 rows x 23 columns]
the strong negative correlations: corr < -0.8¶
In [17]:
strong_negative_corr = corr[corr < -0.8]

sns.heatmap(strong_negative_corr, annot=True, cmap='Blues', mask=mask)

plt.yticks(rotation=0)
plt.show()
//anaconda3/lib/python3.11/site-packages/seaborn/matrix.py:202: RuntimeWarning: All-NaN slice encountered
  vmin = np.nanmin(calc_data)
//anaconda3/lib/python3.11/site-packages/seaborn/matrix.py:207: RuntimeWarning: All-NaN slice encountered
  vmax = np.nanmax(calc_data)
In [18]:
print("Strong Negative Correlations (coefficient < -0.8):")
print(strong_negative_corr.dropna(how='all', axis=0).dropna(how='all', axis=1))
Strong Negative Correlations (coefficient < -0.8):
Empty DataFrame
Columns: []
Index: []

Applying PCA to select components¶

Utilize Scree Plot, Cumulative Variance (targeting at least 80% of total variance), and Kaiser's Criterion to determine the optimal number of principle components.

In [19]:
X.drop(['mahalanobis_distance', 'is_outlier'], axis =1, inplace = True)
In [20]:
X.head()
Out[20]:
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension ... worst radius worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension
0 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871 ... 25.38 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890
1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 ... 24.99 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902
2 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999 ... 23.57 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758
3 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 0.09744 ... 14.91 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300
4 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 0.05883 ... 22.54 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678

5 rows × 30 columns

In [21]:
X.shape
Out[21]:
(569, 30)

Principle Component Analysis¶

In [22]:
# standardized data
sc = StandardScaler()
sc_X = sc.fit_transform(X)

sc_X = pd.DataFrame(sc_X, columns = X.columns)
sc_X.head()
Out[22]:
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension ... worst radius worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension
0 1.097064 -2.073335 1.269934 0.984375 1.568466 3.283515 2.652874 2.532475 2.217515 2.255747 ... 1.886690 -1.359293 2.303601 2.001237 1.307686 2.616665 2.109526 2.296076 2.750622 1.937015
1 1.829821 -0.353632 1.685955 1.908708 -0.826962 -0.487072 -0.023846 0.548144 0.001392 -0.868652 ... 1.805927 -0.369203 1.535126 1.890489 -0.375612 -0.430444 -0.146749 1.087084 -0.243890 0.281190
2 1.579888 0.456187 1.566503 1.558884 0.942210 1.052926 1.363478 2.037231 0.939685 -0.398008 ... 1.511870 -0.023974 1.347475 1.456285 0.527407 1.082932 0.854974 1.955000 1.152255 0.201391
3 -0.768909 0.253732 -0.592687 -0.764464 3.283553 3.402909 1.915897 1.451707 2.867383 4.910919 ... -0.281464 0.133984 -0.249939 -0.550021 3.394275 3.893397 1.989588 2.175786 6.046041 4.935010
4 1.750297 -1.151816 1.776573 1.826229 0.280372 0.539340 1.371011 1.428493 -0.009560 -0.562450 ... 1.298575 -1.466770 1.338539 1.220724 0.220556 -0.313395 0.613179 0.729259 -0.868353 -0.397100

5 rows × 30 columns

In [23]:
# PCA
pca = PCA()
pcs = pca.fit_transform(sc_X)

pcs_df = pd.DataFrame(pcs, columns=[f'PC{i+1}' for i in range(pcs.shape[1])])
pcs_df.head()
Out[23]:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 ... PC21 PC22 PC23 PC24 PC25 PC26 PC27 PC28 PC29 PC30
0 9.192837 1.948583 -1.123166 3.633731 -1.195110 1.411424 2.159370 -0.398407 -0.157118 -0.877402 ... 0.096515 0.068850 0.084519 -0.175256 -0.151020 -0.201503 -0.252585 -0.033914 0.045648 -0.047169
1 2.387802 -3.768172 -0.529293 1.118264 0.621775 0.028656 0.013358 0.240988 -0.711905 1.106995 ... -0.077327 -0.094578 -0.217718 0.011290 -0.170510 -0.041129 0.181270 0.032624 -0.005687 -0.001868
2 5.733896 -1.075174 -0.551748 0.912083 -0.177086 0.541452 -0.668166 0.097374 0.024066 0.454275 ... 0.311067 -0.060309 -0.074291 0.102762 0.171158 0.004735 0.049569 0.047026 0.003146 0.000751
3 7.122953 10.275589 -3.232790 0.152547 -2.960878 3.053422 1.429911 1.059565 -1.405440 -1.116975 ... 0.434193 -0.203266 -0.124105 0.153430 0.077496 -0.275225 0.183462 0.042484 -0.069295 -0.019937
4 3.935302 -1.948072 1.389767 2.940639 0.546747 -1.226495 -0.936213 0.636376 -0.263805 0.377704 ... -0.116545 -0.017650 0.139454 -0.005332 0.003062 0.039254 0.032168 -0.034786 0.005038 0.021214

5 rows × 30 columns

In [24]:
# Scree plot
var_pct = np.round(pca.explained_variance_ratio_ * 100, decimals = 1)

plt.figure(figsize=(10, 6))
plt.bar(range(1, len(var_pct) + 1), var_pct, alpha=0.8)
plt.plot(range(1, len(var_pct) + 1), var_pct, color='k', linestyle=':', marker='o')

plt.xticks(range(1, len(var_pct) + 1))
plt.xlabel('Principle Components')
plt.ylabel('Explained Variance (%)')

plt.show()
In [25]:
pca1 = PCA(0.8)
pca_result1 = pca1.fit_transform(sc_X)

pca1.n_components_ 
Out[25]:
5
In [26]:
##### Kaiser criteria
In [27]:
eigenvalues = pca.explained_variance_

kaiser = sum(eigenvalues > 1)

# Plot eigenvalues
plt.bar(range(1, len(eigenvalues) + 1), eigenvalues, alpha=0.7)
plt.axhline(y=1, color='k', linestyle=':', marker='o', label='Eigenvalue = 1 (Kaiser Criterion)')
plt.xticks(range(1, len(eigenvalues) + 1))
plt.ylabel("Eigenvalues")
plt.xlabel("Component #")
plt.legend()
plt.title("Scree Plot")
plt.show()

explained_variance_ratio = eigenvalues / np.sum(eigenvalues)
cumulative_variance_ratio = np.cumsum(explained_variance_ratio)

components_needed = np.argmax(cumulative_variance_ratio >= 0.8) + 1

print("Number of principal components needed to explain at least 80% of the total variance:", components_needed)
Number of principal components needed to explain at least 80% of the total variance: 5

Answer:

After Utilizing Scree Plot, Cumulative Variance (targeting at least 80% of total variance), and Kaiser's Criterion, the optimal number of principle components is 5.

In [28]:
# loadings
loadings = pd.DataFrame(pca.components_.T[:, :5], columns = ['PC1', 'PC2', 'PC3', 'PC4', 'PC5'], index = X.columns)
loadings
Out[28]:
PC1 PC2 PC3 PC4 PC5
mean radius 0.218902 -0.233857 -0.008531 0.041409 0.037786
mean texture 0.103725 -0.059706 0.064550 -0.603050 -0.049469
mean perimeter 0.227537 -0.215181 -0.009314 0.041983 0.037375
mean area 0.220995 -0.231077 0.028700 0.053434 0.010331
mean smoothness 0.142590 0.186113 -0.104292 0.159383 -0.365089
mean compactness 0.239285 0.151892 -0.074092 0.031795 0.011704
mean concavity 0.258400 0.060165 0.002734 0.019123 0.086375
mean concave points 0.260854 -0.034768 -0.025564 0.065336 -0.043861
mean symmetry 0.138167 0.190349 -0.040240 0.067125 -0.305941
mean fractal dimension 0.064363 0.366575 -0.022574 0.048587 -0.044424
radius error 0.205979 -0.105552 0.268481 0.097941 -0.154456
texture error 0.017428 0.089980 0.374634 -0.359856 -0.191651
perimeter error 0.211326 -0.089457 0.266645 0.088992 -0.120990
area error 0.202870 -0.152293 0.216007 0.108205 -0.127574
smoothness error 0.014531 0.204430 0.308839 0.044664 -0.232066
compactness error 0.170393 0.232716 0.154780 -0.027469 0.279968
concavity error 0.153590 0.197207 0.176464 0.001317 0.353982
concave points error 0.183417 0.130322 0.224658 0.074067 0.195548
symmetry error 0.042498 0.183848 0.288584 0.044073 -0.252869
fractal dimension error 0.102568 0.280092 0.211504 0.015305 0.263297
worst radius 0.227997 -0.219866 -0.047507 0.015417 -0.004407
worst texture 0.104469 -0.045467 -0.042298 -0.632808 -0.092883
worst perimeter 0.236640 -0.199878 -0.048547 0.013803 0.007454
worst area 0.224871 -0.219352 -0.011902 0.025895 -0.027391
worst smoothness 0.127953 0.172304 -0.259798 0.017652 -0.324435
worst compactness 0.210096 0.143593 -0.236076 -0.091328 0.121804
worst concavity 0.228768 0.097964 -0.173057 -0.073951 0.188519
worst concave points 0.250886 -0.008257 -0.170344 0.006007 0.043332
worst symmetry 0.122905 0.141883 -0.271313 -0.036251 -0.244559
worst fractal dimension 0.131784 0.275339 -0.232791 -0.077053 0.094423
In [29]:
cumulative_variance_ratio = np.sum(pca.explained_variance_ratio_[:2])
print("Cumulative explained variance ratio up to 2 components:", cumulative_variance_ratio)
Cumulative explained variance ratio up to 2 components: 0.6324320765155949
In [30]:
print("Explained Variance Ratio for PC1 and PC2:", explained_variance_ratio[:2])
Explained Variance Ratio for PC1 and PC2: [0.44272026 0.18971182]
In [31]:
pca1 = PCA(n_components=2)
principalComponents = pca1.fit_transform(sc_X)
principalDf = pd.DataFrame(data = principalComponents
             , columns = ['principal component 1', 'principal component 2'])
In [32]:
principalDf.head(5)
Out[32]:
principal component 1 principal component 2
0 9.192837 1.948583
1 2.387802 -3.768172
2 5.733896 -1.075174
3 7.122953 10.275589
4 3.935302 -1.948072
In [33]:
finalDf = pd.concat([principalDf, df[['target']]], axis = 1)
finalDf.head(5)
Out[33]:
principal component 1 principal component 2 target
0 9.192837 1.948583 0
1 2.387802 -3.768172 0
2 5.733896 -1.075174 0
3 7.122953 10.275589 0
4 3.935302 -1.948072 0
In [34]:
sns.lmplot(x='principal component 1',y='principal component 2',
           data=finalDf, hue = 'target' ,fit_reg=False,
          height=6, aspect=1)
Out[34]:
<seaborn.axisgrid.FacetGrid at 0x155972310>

The plot shows that just some values are overlapping between cancer and no-cancer patients.

Obtain the eigen vector loadings of PC1 and PC2, then interpret the loadings for the 2 most important features for PC1 and for PC2.¶

In [35]:
loadings_pc1 = loadings['PC1'].sort_values(ascending=False)

loadings_pc1
Out[35]:
mean concave points        0.260854
mean concavity             0.258400
worst concave points       0.250886
mean compactness           0.239285
worst perimeter            0.236640
worst concavity            0.228768
worst radius               0.227997
mean perimeter             0.227537
worst area                 0.224871
mean area                  0.220995
mean radius                0.218902
perimeter error            0.211326
worst compactness          0.210096
radius error               0.205979
area error                 0.202870
concave points error       0.183417
compactness error          0.170393
concavity error            0.153590
mean smoothness            0.142590
mean symmetry              0.138167
worst fractal dimension    0.131784
worst smoothness           0.127953
worst symmetry             0.122905
worst texture              0.104469
mean texture               0.103725
fractal dimension error    0.102568
mean fractal dimension     0.064363
symmetry error             0.042498
texture error              0.017428
smoothness error           0.014531
Name: PC1, dtype: float64
In [36]:
loadings_pc2 = loadings['PC2'].sort_values(ascending=False)

loadings_pc2
Out[36]:
mean fractal dimension     0.366575
fractal dimension error    0.280092
worst fractal dimension    0.275339
compactness error          0.232716
smoothness error           0.204430
concavity error            0.197207
mean symmetry              0.190349
mean smoothness            0.186113
symmetry error             0.183848
worst smoothness           0.172304
mean compactness           0.151892
worst compactness          0.143593
worst symmetry             0.141883
concave points error       0.130322
worst concavity            0.097964
texture error              0.089980
mean concavity             0.060165
worst concave points      -0.008257
mean concave points       -0.034768
worst texture             -0.045467
mean texture              -0.059706
perimeter error           -0.089457
radius error              -0.105552
area error                -0.152293
worst perimeter           -0.199878
mean perimeter            -0.215181
worst area                -0.219352
worst radius              -0.219866
mean area                 -0.231077
mean radius               -0.233857
Name: PC2, dtype: float64
In [37]:
top_features_pc1 = loadings_pc1.head(2)
print("Top 2 features for PC1:")
print(top_features_pc1)

top_features_pc2 = loadings_pc2.head(2)
print("\nTop 2 features for PC2:")
print(top_features_pc2)
Top 2 features for PC1:
mean concave points    0.260854
mean concavity         0.258400
Name: PC1, dtype: float64

Top 2 features for PC2:
mean fractal dimension     0.366575
fractal dimension error    0.280092
Name: PC2, dtype: float64

Construct a logistic regression model using both the complete set of features and solely the first two principal components.¶

Evaluate the model performances based on accuracy, AUC score, and the classification report. Use a 80/20 split for the train and test set.¶

Modeling with Logistic Regression¶

In [38]:
# the complete set of features 
sc_X
Out[38]:
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension ... worst radius worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension
0 1.097064 -2.073335 1.269934 0.984375 1.568466 3.283515 2.652874 2.532475 2.217515 2.255747 ... 1.886690 -1.359293 2.303601 2.001237 1.307686 2.616665 2.109526 2.296076 2.750622 1.937015
1 1.829821 -0.353632 1.685955 1.908708 -0.826962 -0.487072 -0.023846 0.548144 0.001392 -0.868652 ... 1.805927 -0.369203 1.535126 1.890489 -0.375612 -0.430444 -0.146749 1.087084 -0.243890 0.281190
2 1.579888 0.456187 1.566503 1.558884 0.942210 1.052926 1.363478 2.037231 0.939685 -0.398008 ... 1.511870 -0.023974 1.347475 1.456285 0.527407 1.082932 0.854974 1.955000 1.152255 0.201391
3 -0.768909 0.253732 -0.592687 -0.764464 3.283553 3.402909 1.915897 1.451707 2.867383 4.910919 ... -0.281464 0.133984 -0.249939 -0.550021 3.394275 3.893397 1.989588 2.175786 6.046041 4.935010
4 1.750297 -1.151816 1.776573 1.826229 0.280372 0.539340 1.371011 1.428493 -0.009560 -0.562450 ... 1.298575 -1.466770 1.338539 1.220724 0.220556 -0.313395 0.613179 0.729259 -0.868353 -0.397100
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
564 2.110995 0.721473 2.060786 2.343856 1.041842 0.219060 1.947285 2.320965 -0.312589 -0.931027 ... 1.901185 0.117700 1.752563 2.015301 0.378365 -0.273318 0.664512 1.629151 -1.360158 -0.709091
565 1.704854 2.085134 1.615931 1.723842 0.102458 -0.017833 0.693043 1.263669 -0.217664 -1.058611 ... 1.536720 2.047399 1.421940 1.494959 -0.691230 -0.394820 0.236573 0.733827 -0.531855 -0.973978
566 0.702284 2.045574 0.672676 0.577953 -0.840484 -0.038680 0.046588 0.105777 -0.809117 -0.895587 ... 0.561361 1.374854 0.579001 0.427906 -0.809587 0.350735 0.326767 0.414069 -1.104549 -0.318409
567 1.838341 2.336457 1.982524 1.735218 1.525767 3.272144 3.296944 2.658866 2.137194 1.043695 ... 1.961239 2.237926 2.303601 1.653171 1.430427 3.904848 3.197605 2.289985 1.919083 2.219635
568 -1.808401 1.221792 -1.814389 -1.347789 -3.112085 -1.150752 -1.114873 -1.261820 -0.820070 -0.561032 ... -1.410893 0.764190 -1.432735 -1.075813 -1.859019 -1.207552 -1.305831 -1.745063 -0.048138 -0.751207

569 rows × 30 columns

In [39]:
# using 2 PC as predictors
pca_2X = pcs_df.iloc[:, :2]
pca_2X.head()
Out[39]:
PC1 PC2
0 9.192837 1.948583
1 2.387802 -3.768172
2 5.733896 -1.075174
3 7.122953 10.275589
4 3.935302 -1.948072
In [40]:
### Tran/split dataset 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
X_train_2pca, X_test_2pca, _, _ = train_test_split(pca_2X, y, test_size = 0.2, random_state = 42)
In [41]:
X_train.head()
Out[41]:
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension ... worst radius worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension
68 9.029 17.33 58.79 250.5 0.10660 0.14130 0.31300 0.04375 0.2111 0.08046 ... 10.31 22.65 65.50 324.7 0.14820 0.43650 1.25200 0.17500 0.4228 0.11750
181 21.090 26.57 142.70 1311.0 0.11410 0.28320 0.24870 0.14960 0.2395 0.07398 ... 26.68 33.48 176.50 2089.0 0.14910 0.75840 0.67800 0.29030 0.4098 0.12840
63 9.173 13.86 59.20 260.9 0.07721 0.08751 0.05988 0.02180 0.2341 0.06963 ... 10.01 19.23 65.59 310.1 0.09836 0.16780 0.13970 0.05087 0.3282 0.08490
248 10.650 25.22 68.01 347.0 0.09657 0.07234 0.02379 0.01615 0.1897 0.06329 ... 12.25 35.19 77.98 455.7 0.14990 0.13980 0.11250 0.06136 0.3409 0.08147
60 10.170 14.88 64.55 311.9 0.11340 0.08061 0.01084 0.01290 0.2743 0.06960 ... 11.02 17.45 69.86 368.6 0.12750 0.09866 0.02168 0.02579 0.3557 0.08020

5 rows × 30 columns

In [42]:
X_train_2pca.head()
Out[42]:
PC1 PC2
68 4.330003 9.202526
181 9.007166 0.581031
63 -2.314132 3.267990
248 -2.582556 0.729213
60 -2.385836 2.757658
In [43]:
## Training and Predicting

from sklearn.linear_model import LogisticRegression

logmodel_full = LogisticRegression(C=0.001)
logmodel_full.fit(X_train,y_train)
//anaconda3/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
Out[43]:
LogisticRegression(C=0.001)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(C=0.001)
In [44]:
logmodel_2pca = LogisticRegression(C=0.001)
logmodel_2pca.fit(X_train_2pca,y_train)
Out[44]:
LogisticRegression(C=0.001)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(C=0.001)
In [45]:
## Model evaluation
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report

y_pred_full = logmodel_full.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)
auc_full = roc_auc_score(y_test, logmodel_full.predict_proba(X_test)[:, 1])
classification_report_full = classification_report(y_test, y_pred_full)

# Displaying the evaluation metrics
print("Evaluation Metrics for Logistic Regression with Complete Set of Features:")
print("Accuracy:", accuracy_full)
print("AUC Score:", auc_full)
print("Classification Report:")
print(classification_report_full)
Evaluation Metrics for Logistic Regression with Complete Set of Features:
Accuracy: 0.9649122807017544
AUC Score: 0.9990173599737963
Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.93      0.95        43
           1       0.96      0.99      0.97        71

    accuracy                           0.96       114
   macro avg       0.97      0.96      0.96       114
weighted avg       0.97      0.96      0.96       114

In [46]:
y_pred_2pca = logmodel_2pca.predict(X_test_2pca)
accuracy_2pca = accuracy_score(y_test, y_pred_2pca)
auc_2pca = roc_auc_score(y_test, logmodel_2pca.predict_proba(X_test_2pca)[:, 1])
classification_report_2pca = classification_report(y_test, y_pred_2pca)

# Displaying the evaluation metrics
print("\nEvaluation Metrics for Logistic Regression with First Two Principal Components:")
print("Accuracy:", accuracy_2pca)
print("AUC Score:", auc_2pca)
print("Classification Report:")
print(classification_report_2pca)
Evaluation Metrics for Logistic Regression with First Two Principal Components:
Accuracy: 0.8859649122807017
AUC Score: 0.9973796265967901
Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.70      0.82        43
           1       0.85      1.00      0.92        71

    accuracy                           0.89       114
   macro avg       0.92      0.85      0.87       114
weighted avg       0.90      0.89      0.88       114

Compare the performance of the two models from the previous question.¶

In [47]:
eval_results = []

for name, y_pred_data in zip(['Original', 'PCA'], [y_pred_full, y_pred_2pca]):
    accuracy = accuracy_score(y_test, y_pred_data)
    auc = roc_auc_score(y_test, y_pred_data)
    classification_report_data = classification_report(y_test, y_pred_data, output_dict=True)
    
    precision_0 = classification_report_data['0']['precision']
    recall_0 = classification_report_data['0']['recall']
    f1_score_0 = classification_report_data['0']['f1-score']
    support_0 = classification_report_data['0']['support']
    
    precision_1 = classification_report_data['1']['precision']
    recall_1 = classification_report_data['1']['recall']
    f1_score_1 = classification_report_data['1']['f1-score']
    support_1 = classification_report_data['1']['support']
    
    eval_results.append({
        'Model': name,
        'Accuracy': accuracy,
        'AUC Score': auc,
        'Precision (Class 0)': precision_0,
        'Recall (Class 0)': recall_0,
        'F1-score (Class 0)': f1_score_0,
        'Support (Class 0)': support_0,
        'Precision (Class 1)': precision_1,
        'Recall (Class 1)': recall_1,
        'F1-score (Class 1)': f1_score_1,
        'Support (Class 1)': support_1
    })

# Create a DataFrame from the evaluation results
eval_df = pd.DataFrame(eval_results)

# Set index to indicate 'Original' and 'PCA'
eval_df.set_index('Model', inplace=True)

# Display the DataFrame
print(eval_df)
          Accuracy  AUC Score  Precision (Class 0)  Recall (Class 0)  \
Model                                                                  
Original  0.964912   0.958074              0.97561          0.930233   
PCA       0.885965   0.848837              1.00000          0.697674   

          F1-score (Class 0)  Support (Class 0)  Precision (Class 1)  \
Model                                                                  
Original            0.952381               43.0             0.958904   
PCA                 0.821918               43.0             0.845238   

          Recall (Class 1)  F1-score (Class 1)  Support (Class 1)  
Model                                                              
Original          0.985915            0.972222               71.0  
PCA               1.000000            0.916129               71.0  

Overall Performance:¶

  • Original Model: Outperforms the PCA model in most metrics. It demonstrates high accuracy (0.964912 ~ 0.885965), a good AUC score (0.958074), and strong precision and recall for both classes.

  • PCA Model shows significantly lower accuracy and AUC score. While it achieves perfect precision for Class 0, its recall for Class 0 is considerably lower. This suggests that it might be overly conservative in classifying instances as Class 0. On the other hand, it has perfect recall for Class 1, potentially due to being too lenient in classifying instances as Class 1. Reason: the data is unbalance.

  • Based on these metrics, the "Original" model appears to be the better choice overall. It demonstrates a more balanced and reliable performance across both classes. While the PCA model shows perfect precision for Class 0, its low recall suggests potential issues in identifying true Class 0 instances.

  • Advantage: However, in this dataset I propose to use PCA because instead of training 30 features in the original dataset, I just need to train 2 components but the result is still good. And the dataset has all numerical features (no categorical variables) with a lot of pairs with strong linear correlation relationship and potential multicolinearity (need to check VIF), so it's useful to use PCA. So in this case PCA will perform really well.

Run faster. Save computational resources.

  • Disadvantage: he PCA model seems to have an issue with recall for Class 0. This could be due to the fact that PCA reduces dimensionality, potentially discarding information relevant to classifying Class 0 instances correctly. The perfect precision for Class 0 might be an artifact of the model being overly cautious and underpredicting Class 0.

And we still need to collect all of the features from the original dataset, so it may require a lot of resources.

Answer: I still choose PCA with 2 components to train the data